-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VReplication: Improve handling of vplayer stalls #15797
VReplication: Improve handling of vplayer stalls #15797
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
fe9ed6d
to
03a5b03
Compare
Signed-off-by: Matt Lord <[email protected]>
03a5b03
to
bee30d4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15797 +/- ##
=======================================
Coverage ? 68.56%
=======================================
Files ? 1544
Lines ? 197779
Branches ? 0
=======================================
Hits ? 135608
Misses ? 62171
Partials ? 0 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
cb5fc2e
to
b78059d
Compare
Signed-off-by: Matt Lord <[email protected]>
b78059d
to
d4517f2
Compare
…eout Signed-off-by: Matt Lord <[email protected]>
3ebfc1d
to
c666b14
Compare
@rohit-nayak-ps and @shlomi-noach this removes all of the What we lose w/o it is the ability to perform out-of-band monitoring and errors. Meaning that the heartbeat method will only detect a stall when it was due to a failure to commit the transaction which updates the timestamp for the workflow (whether it was done on its own or as part of replicating user generated events). If you both prefer that then I'll update the PR description accordingly. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattlord thank you, I think we should go with this change, seeing that it's so succinct.
Good news is that I can repeat the exact scenario and symptoms seen in production with the Bad news is that neither of the two methods added here are generating the helpful error message. I'm still getting:
So I need to keep digging... |
Signed-off-by: Matt Lord <[email protected]>
@rohit-nayak-ps and @shlomi-noach OK, now that I'm able to repeat the exact issue seen in production (which kicked off this work) — I was missing a detail in that test case which I only realized this week — I know that I've now fixed it with the latest commit. Before that commit, even with the
It's quite some time before that error surfaces and when in the state leading up to it you cannot even With the latest commit, however (
And you can What the PR now does is detect stalls on both "ends" in the vplayer:
I will now update the PR description as well. I will also add the |
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
It says this in the description:
What causes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
@rohit-nayak-ps I'm sorry, I was wrong on the For the
That's here: vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go Lines 322 to 326 in 8c97857
Which is waiting for That goroutine in turn is blocked on this one:
Which is here at line 293 (line numbers are a little off as this was done in my PR branch): vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go Lines 289 to 294 in 8c97857
And the goroutine for that function:
Which is trying and retrying to commit the last relay log batch, cycling through queries like this:
Each of which are doing a table scan so take a few seconds. So it looks like we process the stop request and cancel the controller's context, which then causes the |
…eout Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
go/flags/endtoend/vtcombo.txt
Outdated
@@ -404,6 +404,7 @@ Flags: | |||
--v Level log level for V logs | |||
-v, --version print binary version | |||
--vmodule vModuleFlag comma-separated list of pattern=N settings for file-filtered logging | |||
--vplayer-progress-deadline duration At what point, without having been able to successfully replicate a pending batch of events, should we consider the vplayer stalled; producing an error and log message and restarting the player. (default 5m0s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a flag?
I ask because I would like us to be really mindful about introducing new flags. We have too many already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest of the implementation LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have to be a flag. I don't yet have clear guidance on when you might want to change the default value, only the feeling that such a scenario may come up and being able to use a different value would then be helpful. I could make the 5m deadline static...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it static for now. I was also unable to imagine when you would want to change it.
@mattlord please go ahead and merge this after removing the flag. |
Signed-off-by: Matt Lord <[email protected]>
…eout Signed-off-by: Matt Lord <[email protected]>
Description
Please see the issue for details about the problems we're attempting to solve/improve in this PR.
The improvements here are about detecting when we're not making any progress in the vplayer (running/replicating phase of a vreplication workflow) and showing/logging meaningful errors to replace the eventual generic EOF errors seen before this PR:
It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even
stop
the workflow either as things have backed up so badly (thestop
command hangs for some time before eventually completing, see here).With this PR, you would now by default get this helpful error instead every 5 minutes (in the vreplication workflow message field, in the
_vt.vreplication_log
table, and in the vttablet logs):And you can
stop
the workflow during this time as we're not letting things back up so badly.What the PR now does is detect stalls on both "ends" in the vplayer:
Related Issue(s)
Checklist